Add multimodal embedding & rerank support by roj234 · Pull Request #66 · JamePeng/llama-cpp-python

roj234 · 2026-02-21T14:28:38Z

It works, but duplicate, as llama_chat_format implemented multimodal --- but that does not support embedding models like Qwen-VL-Embedding.
These code heavily refers to llama-server's C++ code (ServerTokens)

JamePeng · 2026-02-21T16:21:36Z

It's best to create a multimodal Embedding class in llama_embedding.py or enhance the existing Embedding class to manage mctx. There's no need to add unnecessary memory usage to llama. Remember to release memory after using new mctx.
If possible, please provide necessary example and test code to illustrate its usage.

roj234 · 2026-02-21T19:48:23Z

Actually I am enhance the existing Embedding class, however I can move mctx management to llama_embedding.py
About memory, I have referred your context_stack and __del to free memory.
I also found llama_chat_format contains the logic for multimodal processing, but it is tightly coupled with the inference execution. It doesn't expose a way to get the processed tokens.

btw, Here is my usage

                    doc = [{"type": "text", "text": f"Name: {filepath.name}"}, 
                           {"type": "image", "image": image_data}]

class RAGModel:
    def __init__(self):
        self._model = LlamaEmbedding(
            # ...
            mmproj_path=...,
            image_min_tokens=...,
            image_max_tokens=...,
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        files = []

        image_id = 0
        # Should not manually concat chat template here...
        tmpl = f"<|im_start|>system\n{instruct}<|im_end|>\n<|im_start|>user\n"
        for item in contents:
            type = item['type']
            if type == 'text':
                tmpl += item['text']
            elif type == 'image':
                image_id += 1
                files.append(item['image'])
                tmpl += f"Picture {image_id}: <__media__>" # __media__ is placeholder in mtmd

        return tmpl +  "<|im_end|>\n<|im_start|>assistant\n", files

    def embed_document(self, contents: List[Dict[str, any]], instruction: str = "Represent the user's input.", return_count: bool = False) -> List[float]:
        text, files = self._tmpl(contents, instruction)
        return self._model.embed_multimodal(text, files, return_count=return_count)

JamePeng · 2026-02-21T20:49:34Z

Currently, there is indeed a lack of a multimodal class similar to llama or sampler to abstract the mtmd_cpp API. The heavyweight and complex implementations of llama_chat_format based on llama 1.5 are indeed difficult to manage.

(cherry picked from commit 4ba212f)

roj234 · 2026-02-24T10:40:37Z

by from llama_cpp.mtmd import Jinja2MultimodalChatFormatter RAGModel can be

def __init__():

        eos_token_id = self._model.token_eos()
        bos_token_id = self._model.token_bos()

        eos_token = (
            self._model._model.token_get_text(eos_token_id) if eos_token_id != -1 else ""
        )
        bos_token = (
            self._model._model.token_get_text(bos_token_id) if bos_token_id != -1 else ""
        )

        self._formatter = Jinja2MultimodalChatFormatter(
            template=self._model.metadata['tokenizer.chat_template'],
            eos_token=eos_token,
            bos_token=bos_token,
            stop_token_ids=[eos_token_id]
        )

    def _tmpl(self, contents: List[Dict[str, any]], instruct: str):
        result = self._formatter([{
            "role": "system",
            "content": instruct
        }, {
            "role": "user",
            "content": contents
        }])

        return result.prompt, result.medias

Contents can be image or audio, support local disk, network url, or bytes/bytearray instance, no video support yet. I thought create_completion is too complex, too, I will create alternate function instead (avoid breaking change)

JamePeng · 2026-02-24T11:56:05Z

你好 @roj234 ，这个PR可以保持继续适配优化，我要先对batch decode和eval的部分进行一些重构，原来的老的执行逻辑会有不对齐的情况，导致新模型运行第一轮后kv cache对不上，这次叠加ggml-org/llama.cpp@2b6dfe8 的变更，我就干脆按照llama.cpp目前比较新的方式进行重构，这会对Embedding的部分有一些干扰，但应该是值得的。

roj234 · 2026-02-24T13:37:09Z

好的，我计划的修改是，除了添加的LlamaEmbedding.embed_multimodal函数之外，再创建一个类似Llama.create_multimodal_chat_completion的函数，它能直接处理请求中的image\audio或者未来的video对象（我看了Qwen VL的代码，它的video实现是用ffmpeg把视频切成nFPS的图片序列，不过不排除未来有新方式的可能，那时候要看mtmd库怎么实现了）
当然，我认为最好（指删掉过时代码，但不再向前兼容）的方式是重构create_chat_completion并删掉llama_chat_format中那几千行的模板和历史包袱，比如把他们做成template中按名称命名的Jinja模板，顺便解决Llama->chat_format->Llama奇怪的调用链

Multimodal embedding (tested on Qwen-VL-Embedding)

07a71ae

add Jinja2MultimodalChatFormatter

c014761

(cherry picked from commit 4ba212f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add multimodal embedding & rerank support#66

Add multimodal embedding & rerank support#66
roj234 wants to merge 2 commits intoJamePeng:mainfrom
roj234:vl-embedding

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 •

edited

Loading

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 24, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roj234 commented Feb 21, 2026

Uh oh!

JamePeng commented Feb 21, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

JamePeng commented Feb 24, 2026

Uh oh!

roj234 commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JamePeng commented Feb 21, 2026 •

edited

Loading